K-means Clustering with Feature Hashing

نویسنده

  • Hajime Senuma
چکیده

One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective ofK-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Randomized Dimensionality Reduction for k-Means Clustering

We study the topic of dimensionality reduction for k-means clustering. Dimensionality reduction encompasses the union of two approaches: 1) feature selection and 2) feature extraction. A feature selection-based algorithm for k-means clustering selects a small subset of the input features and then applies k-means clustering on the selected features. A feature extraction-based algorithm for k-mea...

متن کامل

Locality sensitive hashing: A comparison of hash function types and querying mechanisms

It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Dramatic performance gains are obtained using approximate search schemes, such as the popular Locality-Sensitive Hashing (LSH). Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space. All...

متن کامل

Adaptive Mean Shift Based Clustering in High Dimensions

Feature space analysis is the main module in many computer vision tasks. The most popular technique, k-means clustering, however, has two inherent limitations: the clusters are constrained to be spherically symmetric and their number has to be known a priori. In nonparametric clustering methods, like the one based on mean shift, these limitations are eliminated but the amount of computation bec...

متن کامل

Image authentication using LBP-based perceptual image hashing

Feature extraction is a main step in all perceptual image hashing schemes in which robust features will led to better results in perceptual robustness. Simplicity, discriminative power, computational efficiency and robustness to illumination changes are counted as distinguished properties of Local Binary Pattern features. In this paper, we investigate the use of local binary patterns for percep...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011